fix(transaction): detect duplicate file paths within a single FastAppend batch#2509
fix(transaction): detect duplicate file paths within a single FastAppend batch#2509SreeramGarlapati wants to merge 1 commit into
Conversation
xanderbailey
left a comment
There was a problem hiding this comment.
Nice PR, thanks for the contribution, left a couple of comments / questions
| } | ||
| if !duplicates_in_batch.is_empty() { | ||
| let mut paths: Vec<&str> = duplicates_in_batch.into_iter().collect(); | ||
| paths.sort_unstable(); |
There was a problem hiding this comment.
Curious to know if we need the sort?
There was a problem hiding this comment.
moot after blackmwk's fail-fast suggestion (below) - we now report just the first dup, no list to sort.
| // First, detect duplicate file paths within the batch itself. | ||
| // Collecting straight into a `HashSet` would silently deduplicate | ||
| // and pass corruption through to the manifest, so we walk the list | ||
| // explicitly and surface every distinct duplicated path. |
There was a problem hiding this comment.
This comment is maybe a little verbose?
There was a problem hiding this comment.
ya - shouldn't have been there. dropped it entirely; the function name + err message convey it.
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @SreeramGarlapati for this pr!
| let mut new_files: HashSet<&str> = HashSet::with_capacity(self.added_data_files.len()); | ||
| let mut duplicates_in_batch: HashSet<&str> = HashSet::new(); |
There was a problem hiding this comment.
I think this is over complicated, we should throw error immediately when new_files.insert return false.
There was a problem hiding this comment.
Also I prefer to move this check in the begigning of FastAppendAction's commit method, rather than putting it here.
There was a problem hiding this comment.
agreed - simplified to fail-fast on first insert collision. no extra set, no sort.
There was a problem hiding this comment.
done - extracted as a free fn check_no_duplicate_paths_in_batch in transaction::snapshot, called at the top of FastAppendAction::commit under the same check_duplicate gate. SnapshotProducer::validate_duplicate_files is now purely the cross-snapshot check.
6048b31 to
c6f1688
Compare
…end batch
`SnapshotProducer::validate_duplicate_files` collected `added_data_files`
straight into a `HashSet<&str>` before checking against existing manifests.
That collect step silently dedupes the batch, so two `DataFile` entries
sharing the same `file_path` in one `add_data_files(...)` call were written
into the manifest unchecked and committed without error - producing a
snapshot whose `added_files_count` and read-side row count both
double-count the offending file.
Add `check_no_duplicate_paths_in_batch` as a free function in
`transaction::snapshot`, run it at the top of `FastAppendAction::commit`
under the same `check_duplicate` gate, and fail fast on the first
collision with `ErrorKind::DataInvalid` naming the offending path. The
cross-snapshot half of `validate_duplicate_files` is unchanged; it now
builds its own local `new_files` set from `added_data_files` (no
behavioural change).
Two unit tests:
- three identical paths in one batch are rejected with `DataInvalid`
and the offending path in the message;
- `with_check_duplicate(false)` opt-out still accepts batch duplicates,
matching the opt-out semantics already documented for the
cross-snapshot check.
Closes apache#2507.
|
thanks for the reviews @xanderbailey and @blackmwk - addressed all ur comments. |
| } | ||
| } | ||
|
|
||
| pub(super) fn check_no_duplicate_paths_in_batch(files: &[DataFile]) -> Result<()> { |
There was a problem hiding this comment.
Why putting it in SnapshotProducer? I think we should just put it in FastAppendAction.
| #[async_trait] | ||
| impl TransactionAction for FastAppendAction { | ||
| async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> { | ||
| if self.check_duplicate { |
There was a problem hiding this comment.
I think we should always add this check. The check_duplicate flag is used to save checking against existing data files in snapshot since it may introduce extra io. This is not the case for the newly added data files, and if we don't do the check, it's a bug.
Which issue does this PR close?
What changes are included in this PR?
validate_duplicate_fileswas meant to refuse any append that would point at a file the table already references. It's been doing exactly half of that. The other half — refusing the same file appearing twice inside one batch — has been broken since the function was written, because the very first thing it does is collect the batch into aHashSet. The duplicates are gone before the check looks for them.Concretely: hand
FastAppendActiontwoDataFiles with identicalfile_paths, callcommit()with the defaultcheck_duplicate=true, and you getOk. The resulting manifest holds two entries pointing at the same Parquet file, the snapshot summary'sadded_files_countis wrong, and every scan double-counts the rows. Nothing in the commit path notices.The fix is small. A new free function
check_no_duplicate_paths_in_batchintransaction::snapshotwalks the batch and returnsDataInvalidon the first collidingfile_path.FastAppendAction::commitcalls it at the top, under the samecheck_duplicategate, so the opt-out viawith_check_duplicate(false)still disables both halves of the check together.SnapshotProducer::validate_duplicate_filesis now purely the cross-snapshot check; it builds its own localnew_filesset fromadded_data_filesand is otherwise unchanged.This is a sibling of #1394, which fixed the cross-snapshot half of the same function. Same shape of bug, opposite side of the comparison.
Are these changes tested?
Yes — two unit tests in
crates/iceberg/src/transaction/append.rs.The first throws three identical paths into a single batch and asserts the commit fails with
DataInvalidcarrying the offending path in the message.The second asserts that
with_check_duplicate(false)still accepts batch duplicates — same opt-out behaviour the cross-snapshot check already documents.I also verified the repro by reverting just the
commit()change and rerunning the first test against the unfixed code — it fails exactly where the panic-on-Okbranch lives, which is the bug demonstrated as a failing unit test.